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Abstract 

Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly 
algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To 
apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. A typical SLIM NN 
is a general parallel-sequential computer. Its neurons have threshold activation functions. Its output neu- 
rons may affect the environment, which may respond with new inputs. During a computational episode, 
activations are spreading from input neurons through the SLIM NN until the computation activates a 
special halt neuron. Weights of the NN's used connections define its program. Halting programs form a 
prefix code. An episode may never activate most neurons, and hence never even consider their outgoing 
connections. So we trace only neurons and connections used at least once. With such a trace, the reset of 
the initial NN state does not cost more than the latest program execution. This by itself may speed up tra- 
ditional NN implementations. To efficiently change SLIM NN weights based on experience, any learning 
algorithm (LA) should ignore all unused weights. Since prefixes of SLIM programs influence their suf- 
fixes (weight changes occurring early in an episode influence which weights are considered later), SLIM 
NN LAs should execute weight changes online during activation spreading. This can be achieved by 
applying AOPS to growing SLIM NNs. Since SLIM NNs select their own task-dependent effective size 
(^number of used free parameters), they have a built-in way of addressing overfitting, with the potential 
of effectively becoming small and slim whenever this is beneficial. To efficiently teach a SLIM NN to 
solve many tasks, such as correctly classifying many different patterns, or solving many different robot 
control tasks, each connection keeps a list of tasks it is used for. The lists may be efficiently updated 
during training. To evaluate the overall effect of currently tested weight changes, a SLIM NN LA needs 
to re-test performance only on the efficiently computable union of tasks potentially affected by the cur- 
rent weight changes. Search spaces of many existing LAs (such as hill climbing and neuro-evolution) 
can be greatly reduced by obeying restrictions of SLIM NNs. Future SLIM NNs will be implemented 
on 3-dimensional brain-like multi-processor hardware. Their LAs will minimize task-specific total wire 
length of used connections, to encourage efficient solutions of subtasks by subsets of neurons that are 
physically close. The novel class of SLIM NN LAs is currently being probed in ongoing experiments to 
be reported in separate papers. 
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1 Traditional NNs / Motivation of SLIM NNs / Outline 



Recurrent neural networks (RNNs) are neural networks (NNs) IS) with feedback connections. RNNs are, 
in principle, as powerful as any traditional computer There is a trivial way of seeing this ISTI : a tradi- 
tional microprocessor can be modeled as a very sparsely connected RNN consisting of simple neurons 
implementing nonlinear AND and NAND gates. Compare |69| for a more complex argument. RNNs can 
learn to solve many tasks involving sequences of continually varying inputs. Examples include robot con- 
trol, speech recognition, music composition, attentive vision, and numerous others. Section [TTI will give 
a brief overview of recent NNs and RNNs that achieved extraordinary success in many applications and 
competitions. 

Although RNNs are general computers whose programs are weight matrices, asymptotically optimal 
program search (AOPS) HTl |66l l58l l60l has not yet been applied to RNNs. Instead most RNN learning 
algorithms are based on more or less heuristic search techniques such as gradient descent or evolution (see 
Section fTTl ). One reason for the current lack of AOPS-based RNNs may be that traditional AOPS variants 
are designed to search a space of sequential self-delimiting (SLIM) programs ll42l [3]| (Section [1.21 ). The 
concept of partially parallel SLIM NNs will help to adapt AOPS to RNNs. 

Section [T31 will mention additional problems addressed by SLIM NNs: (1) Traditional NN implemen- 
tations based on matrix multiplications may be inefficient for large NN where most weights are rarely used 
(Section fLS.lb . (2) Traditional NNs use ad hoc ways of avoiding overfitting (Section[T32]i. (3) Traditional 
RNNs are not well-suited as increasingly general problem solvers to be trained from scratch by POWER- 
Play 1 62 1, which continually invents the easiest-to-add novel computational problem by itself (Section 
fT331) . 

Section |2] will describe essential properties of SLIM NNs; Section [T2l will show how to apply incre- 
mental AOPS to SLIM RNNs. 

1.1 Brief Intro to Successful RNNs and Related Deep NNs (61 

Supervised RNNs can be trained to map sequences of input patterns to desired output sequences by gradient 
descent and other methods GHl ISH HJl |52l |44l [3T] |75l . Early RNNs had problems with learning to store 
relevant events in short-term memory across long time lags 1261 . Long Short-Term Memory (LSTM) 
overcame these problems, outperforming early RNNs in many applications 1271 [T4l [TSl |65l |20l 1211 |23l 
|22J . While RNNs used to be toy problem methods in the 1990s, they have recently started to beat all 
other methods in challenging real world applications l63l l65l [TTI 1211 l22l l23l . Recently, CTC-trained l20l 
mulitdimensional l23l RNNs won three Connected Handwriting Recognition Competitions at ICDAR 2009 
(see below). 

Training an RNN by standard methods is as difficult as training a deep feedforward NN (FNN) with 
many layers |26|. However, recent deep FNNs with special internal architecture overcome these problems 
to the extent that they are currently winning many international visual pattern recognition contests Ii63l |5] 
|71 [8] [TOl |9l (see below). None of this requires the traditional sophisticated computer vision techniques 
developed over the past six decades or so. Instead, those biologically rather plausible NN architectures 
learn from experience with millions of training examples. Typically they have many non-linear processing 
stages like Fukushima's Neocognitron |[T3l ; they sometimes (but not always) profit from sparse network 
connectivity and techniques such as weight sharing & convolution l38l [Tl, max-pooling f49l, and contrast 
enhancement like the one automatically generated by unsupervised Predictability Minimization l53l [64l 
[67l . NNs are now often outperforming all other methods including the theoretically less general and less 
powerful support vector machines (SVMs) based on statistical learning theory (77] (which for a long time 
had the upper hand, at least in practice). These results are currently contributing to a second Neural 
Network ReNNaissance (the first one happened in the 1980s and early 90s) which might not be possible 
without dramatic advances in computational power per Swiss Franc, obtained in the new millennium. In 
particular, to implement and train NNs, we exploit graphics processing units (GPUs). GPUs are mini- 
supercomputers normally used for video games, often 100 times faster than traditional CPUs, and a million 
times faster than PCs of two decades ago when we started this type of research. 

Since 2009, my group's NN and RNN methods have achieved many first ranks in international compe- 
titions: (7) ISBI 2012 Electron Microscopy Stack Segmentation Challenge (with superhuman pixel error 
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rate) lIU. (6) IJCNN 201 1 on-site Traffic Sign Recognition Competition (0.56% error rate, the only method 
better than humans, who achieved 1.16% on average; 3rd place for 1.69%) |9| (5) ICDAR 2011 offline 
Chinese handwritten character recognition competition [10|. (4) Online German Traffic Sign Recognition 
Contest (1st & 2nd rank; 1.02% error rate) lE). (3) ICDAR 2009 Arabic Connected Handwriting Compe- 
tition (won by LSTM RNNs (TT ITl, same below). (2) ICDAR 2009 Handwritten Farsi/Arabic Character 
Recognition Competition. (1) ICDAR 2009 French Connected Handwriting Competition. Additional 1st 
ranks were achieved in important machine learning (ML) benchmarks since 2010: (A) MNIST handwrit- 
ten digits data set f38l (perhaps the most famous ML benchmark). New records: 0.35% error in 2010 lH, 
0.27% in 2011 [61, first human-competitive performance (0.23%) in 2012 HO). (B) NORB stereo image 
data set |l39|. New records in 2011, 2012, e.g., (C) CIFAR-10 image data set 133. New records in 
2011,2012, e.g., IfTOl . 

Reinforcement Learning (RL) ||32l |76l is more challenging than supervised learning as above, since 
there is no teacher providing desired outputs at appropriate time steps. To solve a given problem, the 
learning agent itself must discover useful output sequences in response to the observations. The traditional 
approach to RL |76| makes strong assumptions about the environment, such as the Markov assumption: 
the current input of the agent tells it all it needs to know about the environment. Then all we need to 
learn is some sort of reactive mapping from stationary inputs to outputs. This is often unrealistic. A 
more general approach for partially observable environments directly evolves programs for RNNs with 
internal states (no need for the Markovian assumption), by applying evolutionary algorithms Il45ll68ll28ll to 
RNN weight matrices f82] |70l [72l [25 1 . Recent work brought progress through a focus on reducing search 
spaces by co-evolving the comparatively small weight vectors of individual neurons and synapses |19l, 
by Natural Gradient-based Stochastic Search Strategies |80l [73] |74l gS] [HI |79l, and by reducing search 
spaces through weight matrix compression |55, 35 1. RL RNNs now outperform many previous methods 
on benchmarks |19|, creating memories of important events and solving numerous tasks unsolvable by 
classical RL methods. 

1.2 Principles of Traditional Sequential SLIM Programs 

The RNNs of Section [TTj are not designed for AOPS. Traditional AOPS favors short and fast programs 
written in a universal programming language that permits self delimiting (SLIM) programs [42 , 3 1 studied 
in the theory of Kolmogorov complexity and algorithmic probability ([2l]|3l|54l[43][55][56l[57l[30l. In 
fact, SLIM programs are essential for making the theory of algorithmic probability elegant, e.g., ||43]| . 

The nice thing about SLIM programs is that they determine their own size during runtime. Traditional 
sequential SLIM programs work as follows: Whenever the instruction pointer of a Turing Machine or 
a traditional PC has been initialized or changed (e.g., through a conditional jump instruction) such that 
its new value points to an address containing some executable instruction, then the instruction will be 
executed. This may change the internal storage including the instruction pointer. Once a halt instruction is 
encountered and executed, the program stops. 

Whenever the instruction pointer points to an address that never was used before by the current program 
and does not yet contain an instruction, this is interpreted as the online request for a new instruction Il42l l3l 
(typically selected by a time-optimal search algorithm Il66l l58ll60l ). The new instruction is appended to 
the growing list of used instructions defining the program so far. 

Executed program beginnings or prefixes influence their possible suffixes. Code execution determines 
code size in an online fashion. 

Prefixes that halt or at least cease to request any further input instructions are called self-delimiting 
programs or simply programs. This procedure yields prefix codes on program space. No halting or non- 
halting program can be the prefix of another one. 

Principles of SLIM programs are not implemented by traditional standard RNNs. SLIM RNNs, how- 
ever, do implement them, making SLIM RNNs highly compatible with time-optimal program search (Sec- 
tion [321l. 



3 



1.3 Additional Problems of Traditional NNs Addressed By Slim NNs 



1.3.1 Certain Inefficiencies of Traditional NN Implementations 

Typical matrix multiplication-based implementations of the NN algorithms in Section n~T| a/wav^ take into 
consideration all neurons and connections of a given NN, even when most are not even needed to solve a 
particular task. 

The SLIM NNs of the present paper use more efficient ways of information processing and learning. 
Imagine a large RNN with a trillion connections connecting a billion neurons, each with a thousand out- 
going connections to other neurons. If the RNN consists of biologically plausible winner-take-all (WITA) 
neurons with threshold activation functions ifSOll . also found in networks of spiking neurons 1 16|, a given 
RNN computation might activate just a tiny fraction of all neurons, and hence never even consider the out- 
going connections of most neurons. This simple fact can be exploited to devise classes of NN algorithms 
that are less costly in various senses, to be detailed below. 

1.3.2 Traditional Ad Hoc Ways of Avoiding Overfitting 

To avoid overfitting on training sets and to improve generalization on test sets, various pre-wired regularizer 
terms 1 2 1 have been added to performance measures or objective functions of traditional NNs. The idea is to 
obtain simple NNs by penalizing NN complexity. One problem is the ad hoc weighting of such additional 
terms. The present paper's more principled SLIM NNs can learn to actively select in task-specific ways 
their own size, that is, their effective number of weights (= modifiable free parameters), in line with the 
theory of algorithmic probability and optimal universal inductive inference lITTl [34l l54l l43l l55l l56l l57l [30l . 

1.3.3 RNNs as Problem Solvers for PowerPlay 

The recent unsupervised PowerPlay framework 162 1 trains an increasingly general problem solver from 
scratch, continually inventing the easiest-to-add novel computational problem by itself. We will see that 
unlike traditional RNNs, SLIM RNNs are well-suited as problem solvers to be trained by PowerPlay. In 
particular, SLIM RNNs support a natural modularization of the space of self-invented and other tasks and 
their solutions into more or less independent regions. More on this particular motivation of SLIM NNs can 
be found in Section|4] 

2 Self-Delimiting Parallel-Sequential Programs on SLIM RNNs 

Unless stated, or otherwise obvious, to simplify notation, throughout the paper newly introduced variables 
are assumed to be integer-valued and to cover the range implicit in the context. N denotes the natural 
numbers, R the real numbers, e G M a positive constant, m, n, uq, k, i, j, l,p, q non-negative integers. 

The k-th computational unit or neuron of our RNN is denoted u*^ (0 < fc < n{u) € N). w'*^ is the 
real- valued weight on the dkected connection c"^ from to u''. Like the human brain, the RNN may 
be sparsely connected, that is, each neuron may be connected to just a fraction of the other neurons. To 
program the RNN means to set some or all of the weights (w"^). 

At discrete time step t = 1,2, .. . ,tend of a finite interaction sequence with the environment (an 
episode), u''{t) denotes the real- valued activation of u*^. The real- valued input vector x{t) (which may 
include a unique encoding of the current task) has n{x) G N components, where the fc-th component is 
denoted x'^it); we define u^{t) = x^{t) for k — 1,2,..., n{x). That is, the first n{x) neurons are input 
neurons; they do not have incoming connections from other neurons. The current reward signal r{t) (if 
any) is a special real-valued input; we set = r{t). For k = n{x) + 1, . . . , n{x) + n{y), we set 

y'''{t) = u^{t), thus defining the n(j/)-dimensional output vector y{t), which may affect the environment 
(e.g., by defining a robot action) and thus future x and r. For n{x) + 1 < fc < n{u) we initialize u^il) = 
and for 1 <t < t^nd compute u''{t + 1) = f^{Y^i (if is user-defined as an additive neuron) 

or u^{t + 1) = /'"'(H; w^^v!-{t)) (if is a multiplicative neuron). 

Here the function /'"' maps R to R. Many previous RNNs use differentiable activation functions such 
as f^{x) = 1/(1 + e^^), or f^{x) = x. We want SLIM NN programs that can easily define their own 
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size. Hence we focus on threshold activation functions that allow for keeping most units inactive most of 
the time, e.g.: f'^{x) = 1 if a; > 0.5 and otherwise. For the same reason we also consider winner-take-all 
activation functions. Here all non-input neurons (including output neurons) are partitioned into ordered 
winner-take-all subsets (WITAS), like in my first RNN from 1989 IM- Once all + 1) of a WITAS are 
computed as above, and at least one of them exceeds a threshold such as 0.5, and a particular u'^ is the first 
with maximal activation in its WITAS, then we re-define + 1) as 1, otherwise as 0. 

For each c"^' there is a constant cost cost^^ of using between t and t + 1, provided w^^u\t) ^ 0. 
More on this in Section 13341 

A special, unusual, non-traditional non-input neuron is called the halt neuron m'"^'*. If u'"''* is active 
(has non-zero activation) once all updates of time t have been completed, we define tend t, and the 
computation stops. For non-halting programs, tend might be a maximal time limit tum to be defined by 
a learning algorithm based on techniques of asymptotically optimal program search Ii41ii66.i58ii60ll — see 
Section 

2.1 Efficient Activation Spreading and NN Resets 

Procedure Spread (inspired by an earlier RNN implementation |50|) efficiently implements episodes 
according to the formulae above (see Procedure 12. lb . Each u'' is associated with a list out'' of all connec- 
tions emanating from u''. A nearly trivial observation is that only neurons with non-zero activation need 
to be considered for activation spreading. There are three global variable lists (initially empty): old, new, 
trace. Lists old and new track neurons used in the most recent two time steps, to efficiently proceed from 
one time step to the next; trace tracks connections used at least once during the current episode. For each 
u' there is a global Boolean variable used'- (initially 0), to mark which RNN neurons already received con- 
tributions from the previous step during the current interaction sequence with the environment. For each 
c"' there is a global Boolean variable mark"' (initially 0), to mark which connections were used at least 
once. The following real-valued variables are initalized by unless indicated otherwise: u''{now) holds 
the activation of u'' at the current step, next'' is a temporary variable for collecting contributions from 
neurons connected to u'' (initialized by 1 if u'"' is a multiplicative neuron); x''{now) holds the current input 
of u'"' if u'' is an input neuron. The integer variable time (initially 0) is used to count connection usages; 
the given time limit tum will eventually stop episodes that are not halted by the halting unit. The label (*) 
in Spread will be referred to in Section|3]on learning. Spread's results include two global variables: the 
program trace and its runtime time. 

Once Spread has finished, weights of connections in trace are the only used instructions of the SLIM 
program that just ran on the RNN. We observe: tracking and undoing the effects of a program essentially 
does not cost more than its execution, because untouched parts of the net are never considered for resets. 

Note the difference to most standard NN implementations: the latter use matrix multiplications to 
multiply entire weight matrices by activation vectors. The simple list-based method Spread, however, 
ignores all unused neurons and irrelevant connections. In large brain-like sparse nets this by itself may 
dramatically accelerate information processing. 

2.2 Relation to Traditional Self-Delimiting Programs and Prefix Codes 

Since the order of neuron activation updates between two successive time steps is irrelevant, such updates 
can be performed in parallel. That is, SLIM NN code can be executed in partly parallel and partly sequential 
fashion. Nevertheless, the execution follows basic principles of sequential SLIM programs [- 421 [3] [43] |56l 
El [301 (SectionO- 

As mentioned in Section [L2l the latter form a prefix code on program space. An equivalent condition 
holds for the traces computed by Spread in a resettable deterministic environment, as long as we identify a 
given trace with all possible trace variants (an equivalence class) reflecting the irrelevant order of neuron 
activation updates between two successive time steps. (In non-resettable environments, the environmental 
inputs have to be viewed as an additional part of the program to establish such a prefix code condition.) 
Compare also Section lTTl on learning-based NN growth and the label (*) in Spread. 
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Procedure l2.lt Spread 

(see text for global variables and their initialization before the first call of Spread) 

set new := old := trace := nil 
while u'*"'* [now) < threshold do 

get next input vector x(now) 

for fc = 1, 2, . . . , n{x) do 

u^{no'w) := x^ijiow); if u^{now) / append u*' to old 

end for 

for all v} e old do 
for all c'*^ G out^ do 

(*) [If c'*^ was never used before in the current or any previous episode, a learning algorithm (see 
SectionO rnay set it;"^ 7^ for the first time, thus growing the effectively used SLIM RNN by 
c'*"' (and by u'^ in case u*"' was never used before)] 
if w'''' ^ tlien 

if mark^^ = then set mark^^ :— 1 and append to trace 

if u*^ is additive then next^ := next^ + u'- {now)w'-'^ 

else u*^ is multiplicative and next*' := next^ u^now)^/^ 

if used^ = then set used*' := 1 and append m'^ to new 

iime iime + cost'^'^ [long wires may cost more — see Section [34l 

(**) if iime > tum then exit wliile loop 
end if 
end for 
end for 

for u* G new do 

determine final new activation u^now) (either 1 or 0) through thresholding and determination of 
WITAS winners (if any; see Section|2]) 

used' := 0; if is additive then next'' else next' :— 1 [restore] 
end for 

old :— new; new := nil [now old cannot contain any input units] 
delete from old all u' with zero u'{now) 

execute environment-changing actions (if any) based on output neurons; possibly update problem- 
specific variables needed for an ongoing performance evaluation according to a given problem-specific 
objective function [2| (see Section[3]on learning); continually add the computational costs of the above 
to time; once time > tum exit while loop 
end while 

for u' e new [perhaps new 7^ nil in case of premature exit from (**)] do 

used' :— 0; if u' is additive then next' :— else next' :— 1 [restore] 
end for 

for c"' G trace do 

mark"^ :— 
end for 
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3 Principles of Efficient Learning Algorithms (LAs) for SLIM NNs 



Through weight changes, the NN is supposed to learn something from a sequence of training task descrip- 
tions Ti,T2, . ■ .. Here each unique Ti S E"*^-^*^ (i = 1, 2, . . .) could identify a pattern classification task or 
robot control task, where the task description dimensionality n{Ti) is an integer constant, such that (parts 
of) Ti can be used as a non-changing part of the inputs x{t). The SLIM NN's performance on each task is 
measured by some given problem-specific objective function |2|. 

To efficiently change SLIM NN weights based on experience, any learning algorithm (LA) should 
ignore all unused weights. 

Since prefixes of SLIM programs influence their suffixes, and weights used early in a Spread episode 
influence which weights are considered later, weight modifications tested by SLIM NN LAs should be 
generated online during program execution, such that unused weights are not even considered as candidates 
for change. Search spaces of many well-known LAs (such as hill climbing and neuro-evolution; see Section 
ILlj ) obviously may be greatly reduced by obeying these restrictions. 

3.1 LA-Based SLIM NN Growth 

Typical SLIM NN LAs (e.g.. Section [32]i will influence how SLIM NNs grow. Consider the bracketed 
statement in procedure Spread labeled by (*). If some c"^ considered here was never used before, and its 
ui"^ never defined, a tentative value w"^ ^ can be temporarily set here (setting w'*^ = wouldn't have 
any effect), and the used part of the net effectively grows by c'*^ (and u'^ in case u'^ also was never used 
before). Later performance evaluations may suggest to make this extended topology permanent and keep 
w"^ as a basis for further changes. 

This type of SLIM program-directed NN growth is quite different from previous popular NN growth 
strategies, e.g., IT2l . 

3.2 (Incremental Adaptive) Universal Search for SLIM NNs 

LAs for growing SLIM NNs as in Section 13.1! may be based on techniques of asymptotically optimal 
program search BTl |66l l58l |60| . Assume some initial bias in form of probability distributions P'*^ on a 
finite setV = vi,V2, ■ ■ ■ , Vm of possible real-valued values for each w"^. Let n{c}^) denote the number of 
usages of during Spread. Given some task, one of the simplest LAs based on universal search 14 11 is 
file algorithm Universal SLIM NN Search. 



Universal SLIM NN Search (Variant 1) 
for « 1, 2, . . . do 

systematically enumerate and test possible programs trace (as computed by Spread) with runtime 

ExternalCosts{trace)+Y.c^^etrace'^ost^^'>^{<:^^) < 2' ITc^^Gtrace 

until all have been tested, or the most recently tested trace has solved the task and the solution has 
been verified; in the latter case exit and return that trace 
end for 



Here the real-valued expression ExternalC osts{trace) represents all costs other than those of the 
NN's connection usages. This includes the costs of output actions and evaluations. ExternalC osts{tr ace) 
may be negligible in many applications though. The left-hand side of the inequality in Universal SLIM 
NN Search is essentially the time computed by Spread. 

That is. Universal SLIM NN Search time-shares all program tests such that each program trace gets 
not more than a constant fraction of the total search time. This fraction is proportional to its probability. 
The method is neca'-bias-optimal [60 1 and asymptotically optimal in the following sense: If some unknown 
trace requires at most /(fc) steps to solve a problem of a given class and integer size k and verify the 
solution, where / is a computable function mapping integers to integers, then the entire search also will 
need at most 0(/(fc)) steps. 

To explore the space of possible traces and their computational effects, efficient implementations of 
Universal SLIM NN Search use depth-first search in program prefix space combined with stack-based 
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backtracking for partial state resets, like in the online source code ||59l of the Optimal Ordered Problem 
Solver 00PS\6Q\. 

Traditional NN LAs address overfitting by pre-wired regularizers and hyper-parameters ||2l to penahze 
NN complexity. Universal SLIM NN Search, however, systematically tests programs that essentially 
select their own task-dependent size (the number of weights = modifiable free parameters). It favors SLIM 
NNs that combine short runtime and simplicity/low descriptive complexity. Note that small size or low 
description length is equivalent to high probability, since the negative binary logarithm of the probability of 
some SLIM NN's trace is essentially the number of bits needed to encode trace by Huffman coding f29l. 
Hence the method has a built-in way of addressing overfitting and boosting generalization performance 
Il54ll55l through a bias towards simple solutions in the sense of Occam's razor IITTI [34l l43l [30 1 . 

The method can be extended |66 60] such that it incrementally solves each problem in an ordered 
sequence of problems, continually organizing and managing and reusing earlier acquired knowledge. For 
example, this can be done by updating the probability distributions P^^ based on success: once Universal 
SLIM NN SearchUniversal SLIM NN Search has found a solution trace to the present problem, some 
(possibly heuristic) strategy is used to shift the bias by increasing/decreasing the probabilities of weights 
of connections in trace before the next invocation of Universal SLIM NN Search on the next problem 
1661. Roughly speaking, each doubling of trace's probability halfs the time needed by Universal SLIM 
NN Search to find trace. 

One of the simplest bias-shifting procedures is Adaptive Universal SLIM NN Search (Variant 1) 
based on earlier work on sequential programs 166]. It uses a constant learning rate 77 G M; < 77 < 1. 
After a successful episode with a halting program, for each c"^ with n{c^'') > 0, let j/es"^ denote how often 
successive activations u\t) and + 1) {t < tf,nj) were both 1, and let no^^ denote how often v}'{t) was 
1 but u'^X^ + 1) was 0. Note that n(c"'') — yes''^ + no'-'' . Define -1 < A'*^ :— {yes"' ~ no"')/n{c"') < 1. 
The sign of A"" indicates whether u' usually helped to trigger or suppress u''. A Hebb-inspired learning 
rule uses A""' to change P"^ in case of success. 



Adaptive Universal SLIM NN Search (Variant 1) 
for 7 := 1, 2, ... do 

use Universal SLIM NN Search to solve the i-th problem by some solution program trace 
for all / satisfying c"' G trace do 
for all c"^ e out' do 
if A"= < then 

P"'{w"') P"'{w"') + r]A"'P"'{w"') [decrease P"'{w"')] 
else 

P"'{w"') := P"'{w"') + r/A"=(l - P"'{w"')) [decrease 1 - P"'{w"')] 
end if 

for all u G V^, w ^ w"' do 

normalize: P"^{v) :— ^P"^{v), where constant 7 G M is chosen to ensure P"'{vj) = 1 
end for 
end for 
end for 
end for 



To reduce the search space, alternative (adaptive) Universal SLIM NN Search variants do not use 
independent P"' but joint distributions P' for each out' to correlate various P"'. For example, the rule 
may be: exactly one of the connections G out' must have a weight of 1, all others must have -1. (This is 
inspired by biological brains whose connections are mostly inhibitory.) The initial P' may assign equal a 
priori probability to the possible weight vectors (as many as there are connections in out'). 

Yet additional variants of adaptive universal search for low-complexity networks search in compressed 
network space ll36l [35l [TSl. Alternatively, apply the principles of the Optimal Ordered Problem Solver 
OOPS II6OI l58l to SLIM NNs: If a new problem can be solved faster by writing a program that invokes 
previously found code than by solving the new problem from scratch, then OOPS will find this out. 
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3.3 Tracking which Connections Affect which Tasks 



To efficiently exploit that possibly many weights are not used by many tasks, we keep track of which 
connections affect which tasks [62|. For each connection c'*^ a variable list L"' = (T|'^, Tj'^, . . .) of tasks 
is introduced. Its initial value before learning is L[f , an empty list. 

Let us now assume tentative changes of certain used w'*^ are computed by an LA embedded within a 
Spread-like framework (Section 12. Il l — compare label (*) in Spread. That is, some of the used weights 
(but no unused weights!) are modified or generated through an LA, while it is becoming clear during the 
ongoing activation spreading computation which units and connections are used at all — compare Sections 
OandIO 

Now note that the union L of the corresponding L'*^ is the list of tasks on which the SLIM NN's 
performance may have changed through the weight modifications. All T ^ L can be safely ignored — 
performance on those tasks remains unaffected. For T ^ L we use Spread to re-evaluate performance. 
If total performance on all T e L has not improved through the tentative weight changes, the latter are 
undone. Otherwise we keep them, and all affected L^"^ are updated as follows (using the traces computed 
by Spread): the new value L^' is obtained by appending to L^'^-^ those Tj ^ L^liU — 1, • ■ ■ , *) whose 
current (possibly revised) solutions now need at least once during the solution-computing process, and 
deleting those Tj whose current solutions do not use w^'' any more. 

That is, if the most recent task does not require changes of many weights, and if the changed weights do 
not affect many previous tasks, then validation of learning progress through methods like those of Section 
l3.2l or similar ll62l may be much more efficient than in traditional NN implementations. 

3.4 Additional LA Principles for SLIM NNs on Future 3D Hardware 

Computers keep getting faster per cost. To continue this trend within the limitations of physics, future 
hardware architectures will feature 3-dimensional arrangements of numerous connected processors. To 
minimize wire length and communication costs PR)I . the processors should communicate through many 
low-cost short-range and few high-cost long-range connections, much like biological neurons. Given some 
task, to minimize energy consumption and cooling costs, no more processors or neurons than necessary to 
solve the task should become active, and those that communicate a lot with each other should typically be 
physically close. 

All of this can be encouraged through LAs that punish excessive processing and communication costs 
of 3D SLIM NNs running on such hardware. 

Consider the constant cost"'' of using c'*^ in such a 3D SLIM RNN from one discrete time step to the 
next in Spread-like procedures, cost"' may be viewed as the wire length of c"" |40|. The expression 
J2i k cost''^n{d^) can enter the objective function, e.g., as an additive term to be minimized by an LA like 
those mentioned in Section [3] Note, however, that such costs are automatically taken into account by the 
universal program search methods of Section [l!2l 

Like biological brains, typical 3D SLIM RNNs will have many more short wires than long ones. An 
automatic by-product of LAs as in Section 13.21 should be the learning of subtask solutions by subsets of 
neurons most of which are physically close to each other The resulting weight matrices may sometimes 
be reminiscent of self-organizing maps for pattern classification l33l and motor control ll24l |46| . The 
underlying cause of such neighborhood-preserving weight matrices, however, will not be a traditional pre- 
wired neighborhood-enforcing learning rule 1.33. .46J . but sheer efficiency per se. 

4 Experiments 

First experiments with SLIM NNs are currently conducted within the recent PowerPlay framework ll62l . 
PowerPlay is designed to learn a more and more general problem solver from scratch. The idea is to let 
a general computer (say, a SLIM RNN) solve more and more tasks from the infinite set of all computable 
tasks, without ever forgetting solutions to previously solved tasks. At a given time, which task should 
be posed next? Human teachers in general do not know which tasks are not yet solvable by the SLIM 
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RNN through generahzation, yet easy to learn, given what's already known. That's why PowerPlay 
continually invents the easiest-to-add new task by itself. 

To do this, PowerPlay incrementally searches the space of possible pairs of (1) new tasks, and 
(2) SLIM RNN modifications. The search continues until the first pair is discovered for which (a) the 
current SLIM RNN cannot solve the new task, and (b) the new SLIM RNN provably solves all previously 
learned tasks plus the new one. Here the new task may actually be to achieve a wow-effect by simplifying, 
compressing, or speeding up previous solutions. 

Given a SLIM RNN that can already solve a finite known set of previously learned tasks, a particular 
AOPS algorithm [62J (compare Section [l!2b can be used to find a new pair that provably has properties (a) 
and (b). Once such a pair is found, the cycle repeats itself. This results in a continually growing set of tasks 
solvable by an increasingly more powerful solver. The continually increasing repertoire of self-invented 
problem-solving procedures can be exploited at any time to solve externally posed tasks. 

How to represent tasks of the SLIM RNN? A unique task index is given as a constant RNN input in 
addition to the changing inputs from the environment manipulated by the RNN outputs. Once the halting 
units gets activated and the computation ends, the activations of a special pre-defined set of internal neurons 
can be viewed as the result of the computation. Essentially arbitrary computable tasks can be represented 
in this way by the SLIM RNN. 

We can keep track of which tasks are dependent on each connection (Section [331 ). If the most recent 
task to be learned does not require changes in many weights, and if the changed weights do not affect many 
previous tasks, then validation may be very efficient. Now recall that PowerPlay prefers to invent tasks 
whose validity check requires little computational effort. This implicit incentive (to generate modifications 
that do not impact many previous tasks), leads to a natural decomposition of the space of tasks and their 
solutions into more or less independent regions. Thus, divide and conquer strategies are natural by-products 
of PoWERPLAY-trained SLIM NNs Ii62il . Experimental results will be reported in separate papers. 

5 Conclusion 

Typical recurrent self-delimiting (SLIM) neural networks (NNs) are general computers for running arbi- 
trary self-delimiting parallel-sequential programs encoded in their weights. While certain types of SLIM 
NNs have been around for decades, e.g., fSOl, little attention has been given to certain fundamental benefits 
of their self-delimiting nature. During program execution, lists or stacks can be used to trace only those 
neurons and connections used at least once. This also allows for efficient resets of large NNs which may 
use only a small fraction of their weights per task. Efficient SLIM NN learning algorithms (LAs) track 
which weights are used for which tasks, to greatly speed up performance evaluations in response to lim- 
ited weight changes. SLIM NNs are easily combined with techniques of asymptotically optimal program 
search. To address overfitting, instead of depending on pre-wired regularizers and hyper-parameters ||2l, 
SLIM NNs can in principle learn to select by themselves their own runtime and their own numbers of free 
parameters, becoming fast and slim when necessary. LAs may penalize the task-specific total length of 
connections used by SLIM NNs implemented on the 3-dimensional brain-like multi-processor hardware to 
expected in the future. This should encourage SLIM NNs to solve many subtasks by subsets of neurons 
that are physically close. Ongoing experiments with SLIM RNNs will be reported separately. 
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